In this project I analysed a dataset regarding deaths in Milan. The dataset contains, for each day between 1/01/1980 and 31/12/1989:
tot.mort);resp.mort).We have for each day some explanatory variables too:
mean.temp);rel.humid);SO2);TSP).The dataset has been described in Vigotti, M.A., Rossi, G., Bisanti, L., Zanobetti, A. and Schwartz, J. (1996). Short term effect of urban air pollution on respiratory health in Milan, Italy, 1980-1989. Journal of Epidemiology and Community Health, 50, 71-75. and it has been analysed in Ruppert, D., Wand, M.P. and Carroll, R.J. (2003), Semiparametric Regression Cambridge University Press..
The aim of the project is to build a predictive model for tot.mort and resp.mort and to understand which are the variables that mostly affect the probability of death.
In the following plots it is represented the distribution of tot.mort and resp.mort. As we can see, the determinations of tot.mort are big enough to model it as a continuous variable, while the determinations of resp.mort are much lower.
In the following plot it is represented the evolution of the Milan population, according to the ISTAT census. The data can be found here. As we can see, the population between 1980 and 1990 decreased. In order to take into account the decrement of the number of people subjected to the risk, I computed the population day by day with a linear interpolation of the census dates and used this piece of data as an exposure in the models. With this information I computed the mortality rate as: \[ Mortality Rate = \frac{tot.mort}{population} \] Note that this is just a generic mortality rate, not a specific one. So, we are not considering the age distribution of the population. Given that within the period 1980-1990 the Milan population aged, even If the mortality at each age unchanged, the generic mortality rate would have increased.
In the following plots it is represented the distribution of the explanatory variable. As we can see, SO2 is strongly skewed. To deal with this problem, I logarithmically transformed it. Specifically, I computed the following transformation:
SO2_log = log(SO2 + 30)
In the following plots we can see the trend of the mortality rate and the respiratory mortality rate through time. As we can see there is a strong seasonal effect, with a higher mortality in winter and a lower mortality in summer. We can spot some outliers in summer 1983 and in the first quarter 1986. In those periods the mortality was much higher than the one registered in the same period in other years.
In the following plot we can compare the mortality pattern through the year between different years. As we can see, they are quite similar.
In the following plots we can see the trend of the explanatory variables through time. As we can see, all these variables have a strong dependency with time. That means that an eventual strong correlation between these variables and the mortality could be due to a spurious correlation. For example, during winter:
To deal with this problem, I inserted the period of the year in the model with a spline computed on the day of the year.
Observing the mean.temp, trend we can see that in summer 1983, it was hotter than other summers. That could partially explain the higher mortality of that period.
In the following plot we can see the relationships between the variables. As we can see many of them are not linear, therefore GLM could perform poorly and a GAM could be more suitable. In particular, from the plot that represents the mean temperature and the mortality rate (mean.temp, tot_mort_prob), we can see that the mortality rate is higher when it is cold and decrease with warmer temperature, but it rapidly increase with really high temperature.
The best predictive model for mortality rate without thaking into account the time I found is the following:
\[ \begin{align} \frac{Y_i}{pop_i} &= \beta_0 + s(temp_i, humid_i) + \beta_1 SO2^{log}_i + s(TSP_i) + \epsilon_i \\ \epsilon_i &\sim \mathcal{N}(0,\sigma^2) \end{align} \] where:
The criterion for choosing it has been the AIC. To speed up the process, I decided to use classical GAM models and not bayesian models.
The estimations are represented in the following chunk:
fit_gam_noday <- gam(tot_mort_prob ~ s(mean.temp, rel.humid) + SO2_log + s(TSP),
family = gaussian,
data = mort)
summary(fit_gam_noday)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## tot_mort_prob ~ s(mean.temp, rel.humid) + SO2_log + s(TSP)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.9992 0.8702 16.087 < 2e-16 ***
## SO2_log 1.4514 0.1844 7.872 4.56e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(mean.temp,rel.humid) 25.792 28.383 10.275 < 2e-16 ***
## s(TSP) 3.588 4.511 5.981 4.47e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.259 Deviance explained = 26.5%
## GCV = 17.714 Scale est. = 17.561 n = 3652
par(mar = c(4,4,4,3))
layout(matrix(c(1,1,2,3), ncol = 2))
plot(fit_gam_noday,
shade = T, scheme = 2,
residuals = T, all.terms = T)
As we previously observed, we can’t say if the observed effect of the explanatory variables on the mortality is due to a cause-effect relation or to a spurious correlation.
The best predictive model for mortality rate I found is the following:
\[ \begin{align} \frac{Y_i}{pop_i} &= \beta_0 + s(temp_i) + s(doy_i) + \beta_1 year_{1i} + \dots + \beta_J year_{Ji} + \beta_{J+1} weekend_i + \epsilon_i \\ \epsilon_i &\sim \mathcal{N}(0,\sigma^2) \end{align} \] where:
To estimate the parameters I used a bayesian approach with non-informative prior \(\pi(\beta)\propto k\in]0,+\infty[\).
In order to perform the variable selection I tried several GAM with a classical approach for estimating the parameters, because a bayesian estimation would have taken too much time and because, using non-informative priors, the estimated values are quite similar.
SEED = 1234
CHAINS = 4
CORES = 4
ITER = 500
fit_gam_bayes_best <- stan_gamm4(tot_mort_prob ~ s(mean.temp) + s(day.of.year) + factor(year) + factor(weekend),
family = gaussian, data = mort,
chains = CHAINS, iter = ITER, seed = SEED, cores = CORES)
| coeff | 2.5% | mean | 97.5% | signif |
|---|---|---|---|---|
| (Intercept) | 21.34 | 21.74 | 22.15 | *** |
| factor(year)1 | -0.91 | -0.34 | 0.19 | |
| factor(year)2 | -1.62 | -1.08 | -0.52 | *** |
| factor(year)3 | -0.30 | 0.27 | 0.86 | |
| factor(year)4 | -1.65 | -1.11 | -0.56 | *** |
| factor(year)5 | -1.22 | -0.67 | -0.11 | * |
| factor(year)6 | -0.96 | -0.39 | 0.20 | |
| factor(year)7 | -2.07 | -1.54 | -1.03 | *** |
| factor(year)8 | -2.13 | -1.57 | -0.99 | *** |
| factor(year)9 | -1.87 | -1.29 | -0.72 | *** |
| factor(year)10 | -5.33 | -0.70 | 3.94 | |
| factor(weekend)TRUE | -0.77 | -0.49 | -0.19 | *** |
| s(mean.temp).1 | -25.18 | -6.38 | 11.47 | |
| s(mean.temp).2 | -6.78 | 6.63 | 20.97 | |
| s(mean.temp).3 | 2.06 | 16.64 | 31.80 | * |
| s(mean.temp).4 | -18.00 | -7.38 | 2.39 | |
| s(mean.temp).5 | 3.57 | 14.12 | 24.25 | * |
| s(mean.temp).6 | -15.52 | -10.49 | -5.51 | *** |
| s(mean.temp).7 | 13.80 | 21.78 | 29.92 | *** |
| s(mean.temp).8 | -39.20 | -26.88 | -15.69 | *** |
| s(mean.temp).9 | 26.68 | 59.51 | 91.63 | ** |
| s(day.of.year).1 | -22.58 | -1.95 | 18.33 | |
| s(day.of.year).2 | -21.22 | -7.96 | 5.70 | |
| s(day.of.year).3 | -69.97 | -54.99 | -40.88 | *** |
| s(day.of.year).4 | 2.11 | 11.04 | 19.49 | * |
| s(day.of.year).5 | 21.54 | 29.20 | 37.19 | *** |
| s(day.of.year).6 | 14.49 | 19.34 | 23.80 | *** |
| s(day.of.year).7 | 8.34 | 12.00 | 16.22 | *** |
| s(day.of.year).8 | -9.02 | -2.74 | 3.59 | |
| s(day.of.year).9 | 0.01 | 16.80 | 37.37 | * |
| sigma | 3.90 | 3.99 | 4.07 | *** |
| smooth_sd[s(mean.temp)1] | 8.50 | 14.97 | 24.23 | *** |
| smooth_sd[s(mean.temp)2] | 11.37 | 24.99 | 42.00 | *** |
| smooth_sd[s(day.of.year)1] | 14.52 | 20.66 | 29.61 | *** |
| smooth_sd[s(day.of.year)2] | 1.34 | 11.23 | 25.36 | *** |
In the following plot it is represented the marginal effect of day.of.year and mean.temp.
I also investigated whether is better to consider or not the interaction between day.of.year and mean.temp in the spline component of the model (i.e. considering s(mean.temp) + s(day.of.year) or s(mean.temp, day.of.year)).
fit_gam_bayes_int <- stan_gamm4(tot_mort_prob ~ s(day.of.year, mean.temp) + factor(year) + factor(weekend),
family = gaussian, data = mort,
chains = CHAINS, iter = ITER, seed = SEED, cores = CORES)
As we can see, plotting the joint effect of mean.temp and day.of.year for the two models, they are quite similar.
To choose the better among the two, I used the Leave One Out Information Criterion (LOO-IC). According to it, the model without the interaction performs better.
loo_best <- loo(fit_gam_bayes_best)
loo_int <- loo(fit_gam_bayes_int)
loo_best
##
## Computed from 1000 by 3652 log-likelihood matrix
##
## Estimate SE
## elpd_loo -10241.1 46.7
## p_loo 32.4 1.7
## looic 20482.2 93.4
## ------
## Monte Carlo SE of elpd_loo is 0.2.
##
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 3650 99.9% 370
## (0.5, 0.7] (ok) 2 0.1% 313
## (0.7, 1] (bad) 0 0.0% <NA>
## (1, Inf) (very bad) 0 0.0% <NA>
##
## All Pareto k estimates are ok (k < 0.7).
## See help('pareto-k-diagnostic') for details.
loo_int
##
## Computed from 1000 by 3652 log-likelihood matrix
##
## Estimate SE
## elpd_loo -10266.5 49.2
## p_loo 39.1 1.3
## looic 20532.9 98.3
## ------
## Monte Carlo SE of elpd_loo is 0.2.
##
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 3650 99.9% 439
## (0.5, 0.7] (ok) 2 0.1% 510
## (0.7, 1] (bad) 0 0.0% <NA>
## (1, Inf) (very bad) 0 0.0% <NA>
##
## All Pareto k estimates are ok (k < 0.7).
## See help('pareto-k-diagnostic') for details.
compare(loo_best,
loo_int)
## elpd_diff se
## -25.4 12.6
The following plots are aimed to check if the model hypothesis are verified in the data.
From the overlay plot and the statistics computed on the whole dataset the model looks fine.
Considering the statistics computed for each year, something strange appears. Looking to the plots for the standard deviation, we can see that in the years 1983 and 1986 the standard deviation in the data is much higher than expected, while in other years, like 1982, 1987 and 1988 it is lower than expected. That means that the model doesn’t describe properly the variance within years and the hypothesis of homoscedasticity is not verified.
From the following plot we can see that in the years 1983 and 1986 there are some outliers with a particularly high mortality. In 1983 this phenomenon happened during summer, while in 1986 it happened during the first quarter.
For modeling the number of deaths for respiratory issues, I used a Poisson distribution.
The best predictive model for respiratory mortality rate I found is the following:
\[ \begin{align} Y^{resp}_i &\sim \mathcal{Poisson}(\lambda_i) \\ \frac{\lambda_i}{pop_i} &= e^{\beta_0 + s(temp_i) + s(doy_i) + \beta_1 year_{1i} + \dots + \beta_J year_{Ji} + \beta_{J+1} SO2^{log}_i} \end{align} \] where:
fit_resp_gam_bayes_best <- stan_gamm4(resp.mort ~ offset(log(pop_m)) + s(mean.temp) + SO2_log + s(day.of.year) + factor(year),
family = poisson, data = mort,
chains = CHAINS, iter = ITER, seed = SEED, cores = CORES)